GridSample operator performance improvement on bilinear interpolation… by melkap01-Arm · Pull Request #27359 · microsoft/onnxruntime

melkap01-Arm · 2026-02-16T17:14:19Z

Description

This change optimises the GridSample operator in onnxrt.
1- For the GridSample nodes having similar characteristics the camera based 3D object detection model in MLPerf Automotive space, transforming the input to output coordinates with : 2D interpolation mode = linear with padding mode = zeros, align_corners = 0, fast path added.

Linear interpolation: For each (x, y), the code locates the four surrounding integer pixel centers:

(x1, y1) = (floor(x), floor(y)) (top-left)
(x2, y2) = (x1 + 1, y1 + 1) (bottom-right)
The interpolation weights reflect the fractional positions:

dx1 = x - x1, dx2 = x2 - x
dy1 = y - y1, dy2 = y2 - y
The resulting value is the bilinear blend dy2 * (dx2 * p11 + dx1 * p12) + dy1 * (dx2 * p21 + dx1 * p22) where p11…p22 are the input pixels at those four neighbor coordinates.

Padding mode = zeros: Any neighbor index that falls outside [0, W_in-1] × [0, H_in-1] contributes 0 to the interpolation.
Each output pixel (oy, ox) carries normalized coordinates (nx, ny) in [-1, 1]. With align_corners=0, nx = -1 corresponds to a location half a pixel before the leftmost input column (i.e., x = -0.5), and nx = 1 corresponds to half a pixel beyond the rightmost column (x = W_in - 0.5). Same idea vertically for ny.

Fast Path Optimisation : The implementation can precompute all neighbor indices/weights for each output pixel once (they depend only on the grid), then reuse them for every channel. Previously indices and weights were calculated inside the loops which can be as much as (H_out*W_out like 20,000 per batch element in one case) x 32 Channels.

2-optional ARM NEON vectorization added :

vld1_f32(ptr): loads two contiguous float values into a float32x2_t. Used to read the top and bottom neighbor
pairs ([p11, p12], [p21, p22]).
- vcombine_f32(low, high): concatenates two float32x2_t values into one float32x4_t, giving [p11, p12, p21, p22].
- vdup_n_f32(val): duplicates a scalar float into both lanes of a float32x2_t.
- vset_lane_f32(val, vec, lane): writes val into the specified lane of a float32x2_t, letting us form [w11, w12] and
[w21, w22].
- vmulq_f32(a, b): multiplies two float32x4_t vectors element-wise (neighbor pixels × weights).
- vget_low_f32(vec) / vget_high_f32(vec): extract the lower or upper 2 lanes from a float32x4_t as float32x2_t.
- vadd_f32(a, b): adds two float32x2_t vectors element-wise (forming partial sums).
- vpadd_f32(a, b): performs pairwise adds within and across two float32x2_t vectors, collapsing four elements down
to two.
- vget_lane_f32(vec, lane): reads a scalar from a specific lane, giving the final interpolated value.
Most of the performance uplift coming from the 1st optimisation. 2nd optimisation using NEON intrinsics still contributes but not as much as the 1st one.
Overall performance improvement :

1 thread :

2 threads:

Motivation and Context

The fast path handles denormalisation of the linear coordinates and can handle the derivation of the indices by precomputing a separate plan entry per output pixel. In PrecomputeBilinearSamplePlan2D, the loop runs across all H_out * W_out points, using the right nx/ny for each (oy, ox) and storing that point’s four indices, four weights, and mask in plans[idx].
During evaluation, EvaluatePlanForChannel iterates through the same point_count(H_out*W_out) and uses the matching plan entry for each (oy, ox). So we are not reusing one plan across different spatial positions; we precompute one plan per output location and reuse it only across channels (which share the same grid).

… mode Signed-off-by: melkap01 <melike.kaptan@arm.com>

Signed-off-by: melkap01 <melike.kaptan@arm.com>

melkap01-Arm added 2 commits February 16, 2026 16:40

GridSample operator performance improvement on bilinear interpolation…

f9a45ad

… mode Signed-off-by: melkap01 <melike.kaptan@arm.com>

type safety error in wasm build recovered

e2b2192

Signed-off-by: melkap01 <melike.kaptan@arm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GridSample operator performance improvement on bilinear interpolation…#27359

GridSample operator performance improvement on bilinear interpolation…#27359
melkap01-Arm wants to merge 2 commits intomicrosoft:mainfrom
melkap01-Arm:melkap01_gridsample_perf_improve

melkap01-Arm commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

melkap01-Arm commented Feb 16, 2026

Description

Motivation and Context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant